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ABSTRACT 


In this paper we present the results of the Interactive Argument-Pair Extraction in Judgement Document 
Challenge held by both the Chinese AI and Law Challenge (CAIL) and the Chinese National Social Media 
Processing Conference (SMP), and introduce the related data set — SMP-CAIL2020-Argmine. The task 
challenged participants to choose the correct argument among five candidates proposed by the defense to 
refute or acknowledge the given argument made by the plaintiff, providing the full context recorded in the 
judgement documents of both parties. We received entries from 63 competing teams, 38 of which scored 
higher than the provided baseline model (BERT) in the first phase and entered the second phase. The best 
performing system in the two phases achieved accuracy of 0.856 and 0.905, respectively. In this paper, we 
will present the results of the competition and a summary of the systems, highlighting commonalities and 
innovations among participating systems. The SMP-CAIL2020-Argmine data set and baseline models? have 
been already released. 


* Corresponding author: Zhongyu Wei (Email: zywei@fudan.edu.cn; ORCID: 0000-0003-3789-8507). 
? https://www.disc.fudan.edu.cn/data/SMP-CAIL2020-Argmine 
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1. INTRODUCTION 


In a trial process, the opinions, testimonies and results of both sides of the case are all recorded in detail 
in the judgement document [1], an example of which is shown in Figure 1. Traditionally, the summarisation 
of such text information remains to be organized and analyzed by the judge manually, which is highly time 
consuming and of low efficiency. In recent years, with the increasing interest in automatic analysis in the 
judicial field [2, 3, 4], more and more attention has been paid to an automatic system for judicial process, 
from Ulmer's proposal of quantitative methods and probability theory [5], Nagel’s [6] optimization and 
statistical methods, to Liu & Chen's [7], Sulea et al.'s [8] and Katz et al.’s [9] natural language processing 
(NLP) models leveraging more lexical features in judicial documents, which indicates that such a task is 
greatly in need and of practical value. 


Another research area of interest is argumentation mining, since argument is playing an increasingly 
important role in decision making on social issues. As an automatic technique to process and analyze 
arguments, computational argumentation, aimed at mining the semantic and logical structure of the given 
text, has become a rapidly growing field in natural language processing. Existing research on argumentation 
mining covers argument structure prediction [10, 11, 12], claims generation [13-17], and interactive 
argument pairs identification [18-24]. Recently, Cheng et al. [25] extracted argument pairs from peer review 
and rebuttal data in order to study the content, structure and the connections between them. 


BEES WU ACE HOM AVE R EE REC R TE AGED TMERR. 

VK: MPREERCHEURUA EL ACROREAE ROUTER ELLCURER, BOR ATKIB TCR AE HE 
RHAW: —O—mhtFT —-H—H 

AGUA: KRUH, ea RS Re di A Ep. SUC Ae Ds. TAC Re SE 

PHA: BUDE ALAS Ec UU PMY (2018) 31598 BAA BLEU DI SEE, F201S4F 11 13H II 


WAAR: WHK. HAMAR. WHARA 

AVILA: A201S4FSA 23H ISVS BARI. EZ. BE. ARREMAN H EARE ik EE 
SHA: AVOKAAPMRARB RE. MRS ARAARRR, B, 1989££10H3H Hi^E,. RER, PAPRA 
ABA: ALATA, 2OIS*ESAQIA ISIE, SEA ARIK. URZ BR. ARRE 

BAAR: —. RA AMA HOS CD) TE. ARMOR H ; SEERI201S5E7 H3 A f EU BY 

BER: Pid 


Figure 1. An instance of judgement document, which contains the statement of the defense and the plaintiff, the 
judgement date, the result of the trial, the judges’ names, and the recorder’s name. 


In the works mentioned above, an interactive argument pair refers to the one that contains two arguments 
that have logical or semantic interactions with each other, e.g., “The global warming does not affect our 
daily life as the scientists say.”, and “I cannot imagine what my life would be if my homeland is beneath 
the sea level.”, which consists of two arguments, mainly talking about the same topic, the global warming 
in our examples, and the second one is responding to the first argument by hypothesizing the scene of 
global warming. 
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Since during the trial process, the two parties both have to make their own points clear and make 
response to the opposite party, which resembles the process of a debate to a large extent, and it is intuitive 
yet promising to apply computational argumentation methods to such a field. A typical task of this kind is 
to automatically extract the focus of dispute of the two parties in a trial process. Specifically, in a trail 
process, the focus of dispute between the plaintiff and the defense can refer to the arguments that two sides 
propose on fact statement or claim settlement, either consistent with each other or attacking each other, 
an example of which is shown in Figure 2, which is mainly the same with the setting of interactive argument 
pairs extraction. Therefore, such a task is of high practical value since the judge can be free from reading, 
comprehending, and analyzing the lengthy judgement documents manually with an automatic system 
to extract these focuses of dispute, and moreover, improve the efficiency and objectivity of the whole 
trial process. 


No. | Party | Argument Relationship 
WEEK (plantiff): | 2016 £ 1 H 9H, BEP A ORO T eB HEIT fS (iil (Denying) 
WEEK (defence): | tiis A SCR HEBEPR. HUP RUSO IB Az AE SETI ff EON idm 


WEEK (defence): | FHES FLAGS T A (030877 m (Denying) 
JER (plantif): | WER BCG J MEAT, TERRE US WU. JETP 18833.63 jc 
TRIR, RoHS BEL LET AH. 


| | 
| > | UEP (plantiff): | Be ATH HE e ACERT Hoi. SOS | 


sik (defence): Wb^r Bik (Partially Acknowledging) 


Figure 2. An example of three pairs of focus of dispute in one judgement document. Note: Each pair contains a 
sentence (i.e., argument) from the plaintiff and the defense, respectively. Among the three pairs, two of them are 
of Denying relationship and the other is of Partially Acknowledging relationship. 


In order to address the aforementioned task, we hosted the Interactive Argument-Pair Extraction in 
Judgement Document (SMP-CAIL2020-Argmine) Challenge. We constructed a purpose-built data set that 
contains 4,080 entries of argument pairs from 976 judgement documents collected from http://wenshu. 
court.gov.cn/ published by the Supreme People's Court of China. 


All the argument pairs are manually annotated by undergraduates and graduates majoring in law. Each 
of the argument pair consists of one argument from the plaintiff and the other from the defense that interacts 
with each other logically or semantically. During the process of annotation, annotators were given the full 
context of both sides and then required to extract all the interactive arguments between the plaintiff and 
the defense. Note that there can be multiple arguments from the defense that interact with the same 
argument from the plaintiff, and vice versa. 


The task setting referred to the one designed in the Ji et al/s work [23]. The systems participating in 
the SMP-CAIL2020-Argmine Challenge were required to identify the correct argument from the defense 
interacting with the given argument from the plaintiff among the five candidate arguments. That is to say, 
every entry of the collected argument pairs is converted into a multiple argument choice problem with four 
false options. Therefore, performing well in the task requires the system to deeply understand the semantic 
relationship of the given argument from the plaintiff and the candidate arguments. We conduct the 
competition in a two-phase fashion by setting threshold accuracy in the first phase, and only those whose 
system over-performs the baseline models we provide can enter the second phase. The number of argument 
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pairs reaches 4,080, including both the training data sets and the test data sets in two phases. In total, 315 
teams from over 100 colleges and enterprises entered for the competition, 63 of which successfully 
submitted their models. We hope that research and practice in these fields will be stimulated by the 
challenges presented in this competition. 


In this paper, we present a detailed description of the task and the data set, along with a summary of 
the submissions, and discuss the possible future research directions of the task. 


2. RELATED WORK 
2.1 Automatic Analysis of Judicial Documents 


Automatic analysis of judicial documents has been studied for decades. At the very first stage, research 
tended to focus on mathematical and statistical analyses on existing court cases, instead of conclusions or 
methodologies on the prediction or summarisation of judicial documents. Ulmer proposed to suggest some 
uses of quantitative methods and probability theory in analyzing judicial materials [5]. Similar work 
including Nagel’s [6] and Kort’s [26] typically used optimization and statistics to conduct automatic 
judgement prediction. More recently, Lauderdale applied a kernel-weighted optimal classification estimator 
to recover estimates of judicial preferences [27]. 


These years have witnessed the booming in natural language processing (NLP), both theoretically and 
practically. As a natural application scenario of NLP, automation in judicial fields is also getting increasingly 
popular among NLP researchers. As a result, such automatic process of analyzing judicial documents has 
entered a brand new era. Liu and Chen [7] and Sulea et al. [8] extracted word features such as N-grams 
to train classifiers to predict the result of judgement, while Katz et al. [9] utilized case profile information 
(e.g., dates, terms, locations and case types). More advanced, Luo et al. introduced an attention-based 
neural model to predict charges of criminal cases, and verified the effectiveness of taking law articles into 
consideration [28]. 


Besides the automatic systems, a great number of interesting and meaningful tasks have also been 
proposed. For example, Xiao et al. [29] proposed a large-scale legal data set for judgement prediction, 
collected from China Judgments Online®, and then organized a competition for this task [30]. After that, 
more judicial tasks and challenges were brought out such as Xiao et al. [31] and Liu et al. [32]. 


However, existing research mostly focuses on the case-level information understanding, such as applicable 
law articles, charges, and prison terms [29, 30], and insufficient research has noticed the importance of 
automatically extracting the focus of dispute, i.e., the interactive arguments from both sides of the case. 


? http;//wenshu.court.gov.cn/ 
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2.2 Argumentation Mining 


Argumentation mining is also a theoretical research area which has obtained much more attention, 
especially in the nearest years. As a research field in mining the logical and semantic structure in texts, 
various meaningful works have been proposed in recent years. For instance, Baff et al. [33] compared 
content- and style-oriented classifiers on editorials from the Liberal New York Times with ideology-specific 
effect annotations to explore the effect of writing style of editorials to audience of different parties; Ji et al. 
[23] proposed the task of identifying interactive argument pairs in online debate forum such as ChangeMyView 
(CMV), along with a novel representation learning method called Discrete Variational Encoder (DVAE) to 
encode different dimensions of information brought by the arguments in the corpus; Cheng et al. [25] 
collected the text data from peer review and rebuttal process to mine the argumentative relationship 
entailed in such discussion, and proposed a challenging data set of argument pair extraction with a 
multi-task learning framework to address such a task. 


Also, the proposition of pretrained language models such as BERT [34] opens a brand new era of NLP, 
with impressively improved performance in nearly all tasks. 


Obviously, the trial process greatly resembles the debate in many ways, since there are both two parties 
expressing their own opinions on the same topic and attacking each other's arguments. Therefore, it is 
practical to leverage models and methods in argumentation mining in the aforementioned judicial tasks. 


3. DATA SET CONSTRUCTION 


As discussed before, our goal is to construct an automatic system such that it can identify all the 
interactive argument pairs contained in the given judgement document which records the statement of both 
the plaintiff and the defense. Therefore, we collect the related data set from the judgement document 
corpus. 


3.1 Data Source and Preprocessing 


The raw data of judgement are provided by China Justice Big Data Institute, including over 10,000 entries 
in JSON format. 


We first conducted random sampling on the raw data set, finding that there existed some documents of 
low quality. More specifically, the statement from the defense in some documents was so trivial, only 
containing the acknowledgement of all the statement made by the plaintiff; interaction of two sides in some 
documents only focused on the amount of charge, without any semantic or logical interactive arguments; 
and some documents contained too few or too many sentences to be analyzed. 
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In order to solve these problems, we refrained the data set with the following rules: 


e Delete all the entries that contain “EAA if" (forfeiting) or “WX” (having no opposite opinions) 
in the first sentence of the defense’s statement, since very few of these entries refute the statement of 
the plaintiff. 

e Delete all the entries that contain less than two non-charging sentences in either statement of the 
plaintiff or the one of the defense (the “non-charging sentence” means the sentence that does not 
contain figures), as we do not hope the focus of dispute only aims at the amount of charge. 

e Delete all the entries that contain less than four sentences in either statement of the plaintiff or the 
one of the defense, and all the entries that contain more than 1,500 words in the statement of both 
sides, so as to control the length of the data set, thus improving its quality. 


After such filtering, we finally obtained 2,238 instances of judgement documents that are of high quality. 
Then we randomly sampled 40 of the obtained judgement documents and asked four graduate students to 
conduct human annotation of interactive argument pairs extraction. As a result, 120.25 argument pairs were 
extracted per person, and the average agreement was 0.628, which indicates that the task is both plausible 
and challenging. 


3.2 Annotation 


After preprocessing the raw data, we started the annotation of the data set. The platform used for 
annotation is shown in Figure 3, which acts as displaying the sentences in the judgement documents and 
saving the annotation results to database on the server. 


We then employed six annotators who were undergraduates or graduates majoring in law, for more 
professional annotation. Each judgement document was annotated by two different annotators, in order to 
reduce the accidental error. 


As shown in Figure 3, during annotation, the annotators were given the whole statement of both the 
plaintiff and the defense, with each sentence ordered and marked a number. Their task is then two-fold: 


e Annotating features of the case. For the given case, annotators were required to specify some basic 
features of the whole case, including the case type, the type of the crime involved, as well as the 
entities of the plaintiff and the defense. 

e Identifying all the interactive argument pairs in both sides’ statement. The annotators then were 
required to identify all the interactive argument pairs entailed in the given case. Note that the amount 
of such pairs was not constant, so the annotators had to record all the interactive argument pairs by 
adding them one by one. Furthermore, we classified the argument pairs into four emotional categories: 
acknowledging, partially acknowledging, simple denying and active denying. 
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IRER: 


1 . ERRARE ERAR, N RA, ARRE, RARAGA, REER, thee 
MERS RI-35., 

2 . REDDERE, MARERA T HE FERA. 

3 . SEP RATHER, BRIE RSS RISE, HOST, NHBMbRSRITA I fo, RRAN SIA, APANERE, BE 
RESEBEEDUAT, SOWA: Ih, ep. MARS, RRM SARA, SSR GAMBA, Semin, AH (IU) , f£ 
BESSA, ST H36867S7c, i&83X. 

4 . GER LESAGE, £80 DERE T BARK, 

5 EREDE E A RUBER CERIS), Fe EARE ERRAR 127270. 


PHR: 


1. WEAR, MIEEADA LXUEOR, BIRR, RAISES, REESS E To, TAIT (b, 

2 .iBHSPODAMUSUGESE, BAG. 

8 . FABIA RASTIRAE, 

4 ARES A SKNCRERRHPABSRHPISULE, CHRATOGHEISARESAGKNCNODARE RSE, SERRA, ARUSAGVIORKIESSDS D, 

5 . UEA NEUESCAR EE ERASE AME A RP, ARESACAEAIEAE REA SERVII, CEMEESAESEBSIRUR NS, VERRAN WAAR 
ene, ARES ACCERULEAEER AL MIP COSE T 05 de, HAMAR, ARES ASICENENVSATORKIESBA D, FARE, 


WELNPRMATAS (FARS), HEATHER. HRCA, 

ma "SD , CMTC ABA; muh “Sek” , TESEN. 
MERLE PRAACARE/ NSA, EHR. RCA AS "XD. 
M-hEN PLAS TRRR/BIRSUS, BARSA, WM: “RRR. ERR". 


RSE: (ev) 
FEFA: Hiv 

Bree: | |] 
FRE: | É 


ETE: 


Figure 3. The online platform we used in the annotation, displaying the sentences in the judgement documents 
and saving the annotation results to the local server. 


Note that in the second task, besides the identification of interactive argument pairs, the annotators were 
also required to classify each argument pairs collected. The four categories mentioned above represent 
different emotional polarities of the defense. Specifically, the argument pairs of acknowledging generally 
refer to the ones whose defense simply incorporates arguments like ^I confess.", partially acknowledging 
means the defense's argument acknowledges some parts of the plaintiffs but denying the others, simple 
denying contains the simple and direct denial such as “I did not hit the plaintiff.", while the active denial 
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is more complicated, and sometimes it includes completely opposite statement on the same topic, e.g., 
"| did not hit the plaintiff, and instead, the plaintiff hit me with umbrella.”. We conducted the classification 
for the purpose of making it more convenient for the judge to know which argument pairs needed further 
judgement and evidence. With these annotation standards, an instance of annotation is shown in Figure 4. 


Bt: 

Sts: RRS 

See /H: EAR BA 
BEA: set 

ABFA: XU 

ICRI APRA SE: 


REAREA FR ERS METSay, MEERA, AFRA, REAR 
TABECPULSMÉ, MLRS, MARFA, BERTRAM, > BEAR, 1. BRA SKEBE 
FEREMSRA, RRR RARVATEBA, SENCTSBEUEAZEGEBSEDRAJGENEXUME 2. RUM ARCHOS 
f IAASKRAYTA. (SASW 


REAREA TEH ieee, YRSEHHBECTISSUIOEZS, ATRIA, REAR 
MEBFZS9L, MERAK, MARAE, BFRT RAI, > XFIRE, —PEiÉfB 
AMPRRPRIER PRIS, BFW, SREREULTSEOETEXE, XEREOSSHEI GcbRSR ao 
BIS, MARSA. GREW 


BFRABK, REANRMWAT AF, MHETARFISRN20000TFARET. > APARAN, KERR 
RATE PASC. FHSAA). ARREA 


FREER, WEARER, UOSSAECOBNSEIE. > URBSKRMRBSKE, SEE 
SABI. GRRE) 

FESR, MOAGUNUSECENIECGR, WRAARTFIARSITK. > KEHEE, XU SUOXIHTEZSSK 
Raia. AW 


Figure 4. The annotation result of No.12 judgement document, containing four interactive argument pairs 
extracted by the annotators. 


3.3 Statistics on the Data Set 


After six months of annotation, some basic statistics on the data set is shown in Table 1 below. From the 
table we can find that law major students indeed achieved higher agreement, indicating that professional 
knowledge helps improve the performance in this task. Another notable point lies in that interactive 
argument pairs, compared with all the sentence pairs in the corpus, are of very low density and bring 
challenges for automation. 
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Table 1. Basic statistics on the annotated data set. 


Data set Number 
Annotated judgement documents 1,069 
Annotated interactive argument pairs 4,476 
Agreeable argument pairs 1,027 
Disagreeable argument pairs 3,158 
Sentence pairs in the annotated judgement documents 78,943 
Average interactive argument pair density 0.058 
Average agreement among annotators 0.960 


4. TASK DESCRIPTION 
4.1 Task Formulation 


As mentioned above, the density of interactive argument pairs is very low (compared with all the sentence 
pairs between two sides), and thus we have to convert the identification task into an easier one. Our 
approach is to construct a multiple-choice problem for every argument from the plaintiff that occurs in at 
least one interactive pair, by adding four arguments from the defense that does not match the plaintiff's 
argument. That is to say, given an argument sc from the plaintiff, a candidate set of the defense's arguments 
consists of one positive reply bc*, four negative arguments bc; ~ bc;, along with their corresponding 
contexts, and our goal is to automatically identify which argument from the defense has interactive 
relationship with the one from the plaintiff. 


We formulated such a task as a 5-way multiple-choice problem. In practice, the participants’ models 
calculated the matching score S(sc, bc) for each argument in the candidate set with the plaintiffs argument 
sc and treated the one with the highest matching score as the winner. Note that here we did not use the 
emotional tags we collected before, since we would like to focus mainly on the identification of the correct 
argument pair in this competition. 


Note that naturally, this setting needs the number of sentences in the statement of the defense to be no less 
than 5 (or more if there are not only one argument from the defense interacting with the plaintiff's one), so 
some of the entries are discarded and finally our whole data set comprises of 4,080 interactive argument pairs 
(i.e., multiple-choice problems) from 976 judgement documents. An example is displayed in Table 2 below. 


Table 2. An example of the multiple-choice task. 


Statement Sentence 


Full context of the plantiff VRAD EVE RR: PETE AEE 38 SE AUR AT BES Ip RE RE 
Full context of the defense WEARER: RIAPRE RE, nen 


The plantiff's argument lip, gus AGAS ST ZR AI, e 
Candidate argument 1 dub A ib E. CARE ARX. ne 

Candidate argument 2 RFR E. FFE AR AEST AT aR, ee 
Candidate argument 3 Fi YAS Fe 1] TAB EAE ERIT XE ARIE ss 
Candidate argument 4 AURA MANES Sth PE, n 

Candidate argument 5 3 x EY [A] AD SE d AQ AE ERST TA] llt T 34-5] T 
Answer 1 
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4.2 Scoring Metric and Data Set Division 


For the released multiple-choice task, we take accuracy as the evaluation metric. Specifically, if the 
ground truth of the ith problem is y; and the system predicts the answer to be y,, then the average accuracy 
on the test data set of size n is calculated as below: 


accuracy = 2E (1) 


For the purpose of testing the system's generalization more fairly, we organized two phases in the 
competition and thus dividing the data set into three parts, namely SMP-CAIL2020-Argmine train, 
SMP-CAIL2020-Argmine test1, and SMP-CAIL2020-Argmine test2. The quantity of these data sets is 
roughly 3:1:1. 


In the first phase of the competition, participants were provided with the SMP-CAIL2020-Argmine train 
data set to train their systems, and were tested with the SMP-CAIL2020-Argmine_test1 data set. Those who 
exceeded the performance of the given BERT baseline models were admitted to the second phase. And in 
the second phase, participants were provided with the SMP-CAIL2020-Argmine testl data set and tested 
with the SMP-CAIL2020-Argmine test2 data set. The participants’ final score = 0.3 * Score, + 0.7 * Score, 
in which the Score, and Score, means their score in two phases, respectively. 


4.3 Baseline Models 


Before we released the competition, we ran the following baseline models on the data set to obtain the 
border line for the admission to the second phase. Notice that for every baseline model, we only took the 
SMP-CAIL2020-Argmine train data set as the training set. 


e All 1 
This model directly output answer 


"n n 


, which was used to examine whether the distribution of the 
answers was shuffled randomly enough. 

e Common Words 
This model returned the candidate argument that had most common words with the given argument 
from the plaintiff, which was a simple and straightforward model leveraging lexical features. 

e BiLSTM 
This model first conducted word segmentation using Jieba [35], and then we concatenated the 
plaintiffs argument with candidate arguments separately. In this way, we converted the 5-way multiple- 
choice into 5 sentence-pair classification problems. Then we randomly abandoned three negative 
sentence pairs so as to make the two classes balanced. For each sentence pair, their embedding was 
sequentially fed into a BiLSTM [36, 37] and took its final hidden state into a linear classifier to output 
the final prediction. The Figure 5(a) shows the model's overall framework. 
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e BERT 
BERT [34] is a pretrained language model based on transformers, and has proved to be exceedingly 
superior to many research aspects in NLP. In our experiment, we also converted the problem into the 
sentence-pair classification since it could be much easier to apply the BERT model to such a problem. 
The Figure 5(b) shows the model's overall framework. 


[CLS] ARA P XC [SEP] MEAL XR ence Pai [CLS] AA 8X3 — [SEP] FERA ZR 


Figure 5. The overall framework of two neural network baseline models, in which (a) refers to the BiLSTM model 
and (b) refers to the BERT model. 


All baseline models’ performance is shown in Table 3 below. Since the best baseline model gives out an 
accuracy of 0.7476, we set the border line of the first phase at 0.75. 


Table 3. Performance of all baseline models. 


Model name Train accuracy Test1 accuracy Test2 accuracy 
All 1 0.2009 0.1890 0.1922 
Common Words 0.4904 0.4908 0.5275 
LSTM 0.8742 0.6270 0.6793 
BERT 0.8812 0.7476 0.7797 


4.4 Submissions 


The SMP-CAIL2020-Argmine Challenge was hosted on CAIL®, which allowed submissions to be scored 
against the blind test set without the need to publish the correct labels. The two phases of the scoring system 
were open from June 1 to July 9, and July 10 to August 3, 2020. Participants were limited to 3 submissions 
per week. 


? http:/cail.cipsc.org.cn 
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5. COMPETITION DETAILS 
5.1 Participants and Results 


There are over 300 teams from various universities as well as enterprises who have registered for SMP- 
CAIL2020-Argmine, 63 teams who have submitted their models in the first phase, and 21 teams who have 
submitted their final models. The final accuracy shows that neural models can achieve considerable results 
on the task, especially when given a larger training set. In Table 4, we list the scores of Top 7 participants 
of the task. We have collected the technical reports of these contestants. In the following parts, we summarize 
their methods and tricks according to these reports. The performance of all participants on SMP-CAIL2020- 
Argmine will be found in Appendix A. 


Table 4. Performance of participants on SMP-CAIL2020-Argmine. 


Team Score, Score, Final score 
zero point 0.852 0.896 0.8828 
a-U 0.816 0.901 0.8755 
quanshuizhihuiguan 0.802 0.905 0.8741 
i 0.811 0.886 0.8635 
tiaodalanmao 0.800 0.857 0.8399 
wf 0.788 0.853 0.8335 
zhihuizhengfa 0.788 0.853 0.8335 


5.2 The Submitted Models 
5.2.1 General Architecture 


Pretrained Language Model. Ever since BERT [34] was publicly proposed, the whole NLP area has 
been pushed into a new era, with almost all tasks improved in performance. Also, among the baseline 
models above, BERT gives out the best performance on the task, and therefore makes the pretrained 
language model such as Sentence-BERT [38], RoBerta [39], and ERNIE [40] popular in submissions. 


Fine-tuning Mechanisms. After leveraging the pretrained models mentioned above to obtain embedding 
for tokens and sentences, fine-tuning is needed to further improve the model's performance, including: 


e Attention. A natural idea to further fine-tune the representation of the arguments is to leverage the 
attention mechanism between the plaintiff's argument and five candidate arguments separately. 

e RNN Layers. Note that after using the pretrained models, we have token-level, sentence-level as 
well as sentence-pair-level representation (the representation of [CLS]). Therefore, we can retain the 
sentence-pair-level representation, and feed the tokens' embedding into another BiLSTM layer and 
concatenate them before the linear classifier. 

e Memory Networks. All the methods mentioned above only use the information of the arguments. 
However, we have provided the whole context of both sides in the judgement documents. Hence, it 
is plausible to use memory networks [41] to retrieve the context information. 
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5.2.2 Promising Tricks 


Other than the standard “pretrained model + fine-tuning” mode, there are some useful tricks which can 
address the issues met in the task and improve the sentence pair classification models significantly. We 
summarize them as follows: 


Fine-tuning with external corpus. Teams such as "zero point", “quanshuizhihuiguan” as well as 
“tiaodalanmao” all tried to fine-tune their pretrained model by adding external judicial corpus. Such a 
method helps improve the model since external judicial corpus enables the pretrained language models to 
learn more topic-specific language knowledge and therefore performs better in judicial settings. As is 
reported by them, this method enables the model to have an increase in accuracy by about 1%. 


Data Augmentation and Data Balancing. The “a-U” team followed our way of constructing the 
multiple choices and generated more multiple-choice questions for training by retrieving more negative 
samples from the provided contexts of the defense, which helps the model to further leverage the context 
information and incorporate more textual knowledge. Moreover, to address the problem of data imbalance 
(too many negative samples), they used over-sampling on positive instances to avoid the model's getting 
lost in the overwhelming size of negative samples. 


Loss Function. Most models use cross entropy as their loss functions. However, some models adopt 
more promising loss functions, such as focal loss [42] to enhance the performance on low frequency 
categories, and triplet loss to improve the model's ability of generalization. Besides, the loss weights of 
various categories and the activation functions of the output layer also have great influence on the final 
performance. As is reported by the competitors, such a method transforms the task into an argument pair 
ranking problem, instead of the classification problem, which helps the model to gain an improvement of 
over 4%. 


Model Ensembling. Some participants trained several different classification models over different 
samples from the whole data set, and finally combined them with majority voting or weighted average 
strategies to combine their predicting results. Among all the participants using such a method, the "a-U" 
team trained five sequence classification models based on BERT and adopted the majority voting method 
to reduce the variance of a single model, therefore improving the robustness of the model, which finally 
helps their model to achieve the second prize of the competition. 


5.2.3 Error Analysis 


Here, we inspect the erroneous outputs of our model to identify major causes of mismatches. There are 
mainly two issues. 


Sentence Length Limitation in Pretrained Models. Since pretrained models like BERT have maximal 
length limitation, i.e., they will truncate sentence pairs that contain huge size, thus making the model 
unable to process all the information entailed in the sentence pair. 
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Entity Mismatch. Among many false cases, the error caused by entity mismatch is quite common. In 
the cases where there are multiple defences, the plaintiff may propose different prosecutions to different 
defences. However, some of them may share the same action mentioned by plaintiff, thus making the model 
confused when the negative candidate argument contains the detailed action while the positive one only 
includes simple denial. 


6. CONCLUSION AND FUTURE WORK 


In SMP-CAIL2020-Argmine, we employ the interactive argument-pair extraction in judgement document 
as the competition topic. In this competition, we construct and release a brand new data set for extracting 
the focus of dispute in the judgement documents. The performance on the task was significantly raised with 
the efforts of over 300 participants. In this paper, we summarize the general architecture and promising 
tricks they employed, which are expected to benefit further research on legal intelligence. However, there 
is still a long way to go to fully achieve the goal of automatically extracting the focus of dispute since the 
task is already a simplified one. Also, leveraging some more case-based features such as the type of case 
and type of crime and the semantic label of the interactive argument pairs may possibly further improve 
the model’s performance. 
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APPENDIX A: FULL RANK OF ALL PARTICIPANTS 


Full rank of all participants in CAIL-SMP2020-Argimine. Score, and Score, refer to the score achieved 


by the participants in phase | and II, respectively, while Final Score refers to the weighted sum of Score, 


and Score,. 


Table A1. Full rank of all participants. 


Team Score, Score, Final Score 
zero point 0.852 0.896 0.8828 
a-U 0.816 0.901 0.8755 
quanshuizhihuiguan 0.802 0.905 0.8741 
i 0.811 0.886 0.8635 
tiaodalanmao 0.800 0.857 0.8399 
wf 0.788 0.853 0.8335 
zhihuizhengfa 0.788 0.853 0.8335 
quanzhizhixing 0.789 0.852 0.8331 
bl ssk 0.787 0.852 0.8325 
xiaocuiwawa 0.785 0.852 0.8319 
fabaozhineng 0.796 0.847 0.8317 
fajixianzonghewozuodui 0.785 0.851 0.8312 
zhuimengzhizixin 0.794 0.847 0.8311 
CBD 0.779 0.853 0.8308 
xiaofa 0.777 0.852 0.8295 
xiaozhineng 0.775 0.840 0.8205 
testing 0.756 0.845 0.8183 
falvzhineng 0.763 0.841 0.8176 
boys 0.760 0.826 0.8062 
301deshuishou 0.768 0.810 0.7974 
SOS 0.755 0.797 0.7844 
qilejingtu 0.780 
TEEMO 0.780 
qweasd 0.774 
maitianxback 0.772 
duimingmeixianghao 0.771 
wisdom 0.768 
anonymous 0.768 
seu 0.768 
DN 0.768 
hongseyoujiaosanbeisu 0.768 
000 0.768 
OO 0.768 
zhangyuanyu 0.768 
daminghu 0.757 
zunjisoufa 0.757 
yunshujingjixue 0.755 
DL 0.753 
tiantianxiangshang 0.751 
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Team Score, Score, Final Score 
jizhikekeyupipi 0.751 
ddlqianzuihouchongci 0.750 
Tracee 0.748 
zhineng 0.744 
heitu 0.736 
chong! 0.728 
nlpxiaoxuesheng 0.725 
sr 0.719 
huangjinkuanggong 0.714 
hello 0.708 
aaaa 0.706 
houchangcunbaoan 0.704 
EC_lab 0.680 
imiss 0.672 
nnnnO1 0.629 
zhegexiaohaiyoudiandou 0.598 
woshijiangdaqiao 0.520 
nlp11 0.517 
mushangdaren 0.491 
Eupho 0.491 
LawBoys 0.491 
xuexijishudui 0.491 
test11 0.472 
Iw 0.344 
amazing 0.083 
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